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Abstract 

Conditional Random Rields (CRF) have been widely applied in image segmentations. While most studies rely on hand¬ 
crafted features, we here propose to exploit a pre-trained large convolutional neural network (CNN) to generate deep 
features for CRF learning. The deep CNN is trained on the ImageNet dataset and transferred to image segmentations 
here for constructing potentials of superpixels. Then the CRF parameters are learnt using a structured support vector 
machine (SSVM). To fully exploit context information in inference, we construct spatially related co-occurrence 
pairwise potentials and incorporate them into the energy function. This prefers labelling of object pairs that frequently 
co-occur in a certain spatial layout and at the same time avoids implausible labellings during the inference. Extensive 
experiments on binary and multi-class segmentation benchmarks demonstrate the promise of the proposed method. We 
thus provide new baselines for the segmentation performance on the Weizmann horse, Graz-02, MSRC-21, Stanford 
Background and PASCAL VOC 2011 datasets. 

Keywords: Conditional random field (CRF), Convolutional neural network (CNN), Structured support vector 
machine (SSVM), Co-occurrence 


1. Introduction 

The task of image segmentation is to produce a pixel level labelling of different object categories, with wide 
variety of applications ranging from image retrieval to object recognition. It is challenging as the objects may appear 
in various backgrounds and different visual conditions. CRFs ll24l model the conditional distribution of labels given 
observations, representing the state-of-the-art in image/object segmentation f38ll37lH2ll30l l33l . In l38l . Szummer el 
al. proposed to learn the coefficients of CRF potentials using structured support vector machines (SSVM) and graph 
cuts. Since then, SSVM has been widely applied for CRF learning in segmentation tasks. 

In the pipeline of CRF learning based image segmentation, finding a good feature representation is of great sig¬ 
nificance, and can have a profound impact on the segmentation accuracy. Most previous studies rely on hand-crafted 
features, e.g., using color histograms, HOG or SIFT descriptors to construct bag-of-words features 
Recently, feature learning and especially deep learning methods have gained great popularity in machine learning and 
related fields. This type of methods typically take raw images as input and learn a (deep) representation of the images, 
and have found phenomenal success in various tasks such as speech recognition C3, image classification @1, ob¬ 
ject detection m etc. See Bengio et al. a for a detailed review. Deep learning methods attempt to model high-level 
abstractions in data at multiple layers, inspired from the cognitive processes of human brains, which generally starts 
from simpler concepts to more abstract ones. The learning is achieved by using deep architectures, e.g., deep belief 
networks (DBNs) El, stacked autoassociator networks a, deep convolutional neural networks (CNNs) I25ll20ll28l . 
etc. Among them, CNNs are high-capacity machine learning models with a very large number of (typically a few 
million) parameters that are optimized from labelled training examples. The success of CNNs in various vision tasks 
03 Ha is mainly due to their ability to learn rich mid-level features that accommodate within-class variance and at 
the same time possess discriminative information. This is in contrast to low-level hand-crafted features. 

On the other hand, prior work 04l 123 351 has demonstrated that holistic reasoning about the occurrences of all 
classes helps to improve segmentation performance. These are based on the considerations that neighbouring image 
regions may be occupied by frequently co-occurring objects, and object pairs of mutual exclusion are less likely to 
appear together. For example, a cow is more likely to show up together with grass rather than a monitor, and grass 
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is less likely to appear above sky. Therefore, we here propose to construct spatially related co-occurrence pairwise 
potentials to exploit the context information during inference. 

In summary, we highlight the main contributions of this work as follows. 

• We show that cross-domain image features learned by CNNs with labelled data from ImageNe{[] can be suc¬ 
cessfully transferred for segmentation purpose. By thoroughly evaluating the performance of the CNN features 
of different depths and comparing with the traditional bag-of-words and unsupervised feature learning methods, 
we demonstrate the power of CNN features in image segmentation. 

• We illustrate that SSVM based CRF learning with CNN features yields astounding results and thus provide new 
baselines for segmentation performance on the Weizmann horse, Graz02, MSRC-21, Stanford Background and 
PASCAL VOC 2011 datasets. 

• We incorporate spatially related co-occurrence pairwise potentials into the inference and gain further perfor¬ 
mance boost. 

2. Related work 

We briefly review some work that is relevant to ours. The first work on using convolutional networks for scene 
parsing is m. In ED, they train a deep CNN using a supervised greedy learning strategy taking pixels as input 
to yield a pixel-wise labelling of an image. While somewhat preliminary, they achieved marginal improvement over 
CRF learning based segmentation methods. We show in this paper that deep CNN features transferred from ImageNet 
(ImageNet is an image dataset organized according to the WordNet hierarchy, containing millions of labelled images.) 
combined with SSVM based CRF learning outperforms most state-of-the-art methods. Schulz et al. 1361 propose to 
predict the segmentation mask by adding a pairwise class location filter to the conventional CNN architecture of ll25l . 
In the work of iflOl . the authors use a multiscale convolutional network trained from raw pixels to extract dense feature 
vectors that encode regions of multiple sizes centered on each pixel and present impressive results on several datasets. 
Our work differs from lITOl in two aspects. First, we transfer a deep CNN trained on the ImageNet ll20l dataset to 
segmentation while ED trains a 3-stage convolutional network li25l on the current training data of the segmentation 
dataset, and we demonstrate experimentally that better performance can be achieved by our method. Secondly, our 
method uses SSVM to learn CRF potentials while no learning is involved in IflOl . Figure [T] shows a sketch of our 
segmentation pipeline. 

Most recently, Girshick et al. Ifl3l demonstrate that a deep CNN trained on ImageNet can be successfully trans¬ 
ferred to object detection and great performance boost is achieved on the PASCAL VOC 2012 dataset. As an extension 
of their statement, they also conduct a scene labelling experiment on the PASCAL VOC segmentation dataset to val¬ 
idate the power of deep CNN features on the segmentation task. Our work is mainly inspired from theirs, but differs 
in that we combine deep CNN features with SSVM based CRF learning in contrast to their region proposals and sup¬ 
port vector regression based method. Furthermore, we thoroughly evaluate the performance of deep CNN features 
compared to the bag-of-words features and unsupervised learned features, and provides new baselines for labelling 
performance on various segmentation benchmarks. 

Co-occurrence statistics have been exploited and demonstrated its strength in the community. In P4l . the authors 
incorporate semantic object context as a post-processing step by considering the co-occurrence counts of label pairs. 
Ladicky et al. j23l explores the inference methods for CRF with co-occurrence statistics by considering a class of 
global potentials. Different from their methods that ignore spatial relations of the co-occurrences, we propose to 
construct spatially related co-occurrence pairwise potentials, which favor labellings of object pairs that frequently 
co-occur in a certain spatial layout while at the same time prevents unreasonable labellings. Our method is inspired 
from l35l but differs in that they incorporate the mutex information by adding a mutex constraint to the inference 
problem while we simply construct co-occurrence pairwise potentials, and most importantly, we explore CNN features 
combined with SSVM based CRF learning. 
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Figure 1: An illustration of the proposed segmentation pipeline. We first over-segment the image into superpixels and then compute deep convo¬ 
lutional features of the patch around each superpixel centroid using a pre-trained deep CNN. The learned features are then used to learn a CRF for 
segmentation. 


3. Method 

3.1. Segmentation using CRF models 

Given X = {x,} a collection of image instances with corresponding labels Y = {y, }, where i indexes images, 
CRF i24l considers the log-loss of the overall energy 

P(y|x; w) = ^ exp(— ^ E(y h x,; w)). (1) 

i 

where w are parameters and Z the normalization term. The energy E of an image x with segmentation labels y over 
the nodes (superpixels) X and edges S, takes the following form: 

E(y,x;w) = ^$W(^,x;w)+ £ (y*\ y\ x; w). (2) 

pG3Nf (p,q) GS 

Here x £ X,y £ y; <1» ( 1 ' and < P < - 2 ' are the unary and pairwise potentials, both of which depend on the observations 
as well as the parameter w. CRF seeks an optimal labelling that achieves maximum a posterior (MAP), which mainly 
involves a two-step process l38l : 1) Learning the model parameters from the training data; 2) Inferring a most likely 
label for the test data given the learned parameters. The segmentation problem thus reduced to minimizing the energy 
(or cost) over y by the learned parameters w, which is: 

y* = argmin E(y, x; w). (3) 

yey 


3.2. Learning CRF in the large-margin framework 

Applying the large-margin based CRF learning is to solve the following optimization: 


min \ ||w ||2 

w ,£>0 2 2 


c 

m 




s.t.: E(y, Xj; w) — E{ y i; x*; w) > A(y u y) 


Vi = 1,..., to, and Vy € y;. 


(4) 


where Z\ : y x y e- >Kisa loss function associated with the prediction and the true label mask. In general, we have 
2l(y,y) = 0 and A( y,y') > 0 for any y' f y. Intuitively, the optimization in (|4]) is to encourage the energy of the 
ground truth label E{y t . x, : ; w) to be lower than any other incorrect labels E{ y, x$; w) by at least a margin A(y t . y). 
The SSVM solves Q by iteratively finding the most violated constraint for each example i: 


y* = argmin E(y, x; w) - A(y i} y). 
yey 


( 5 ) 


1 http://image-net.org 
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To learn CRF in the large margin framework, we consider energy functions that are linear in the parameter w, 
which indicates that the unary and the pairwise potentials in Q can be written as: 

<f> ( 1 ) (y p ,x; w) = (|w (1 \ (j)^\y p , x)^ , ( 6 ) 

and 

<5> i2 Xy p ,y q ,x;w) = (w {2) ,^ 2) (y p ,y q ,x)^ , ( 7 ) 

where (ft 1 ', (ft 2 ) are the unary and pairwise feature mappings respectively and (■, •) denotes inner products. Clearly 
we have w = w' : ! j <g> w !2> (<g> stacks two vectors). We will show how to construct the feature mappings over the 
learned deep features in the following. 

Implementation details. After obtaining the learned deep features, we define feature mappings upon them to con¬ 
struct the energy function. Consider the image x with label y, let x p be the feature vector associated with the p-th 
superpixel, and K is the number of classes (possible labels). Then we define the unary feature mappings as 

0 (1 V, X) = [I(y p = l)x pT ,..., I(y p = K)x pT ] T , (8) 

where /(•) is an indicator function which equals 1 if the input is true and 0 otherwise. In the case of multi-class, 

the dimension of (y p , x) can be too large when x p is high dimensional. To address this issue, we first train an 

one-vs-all multi-class linear SVM over the features of superpixels, and then use the output confidence scores of the 
p-th superpixel as x p to construct the unary potential. Similar strategy is used in lH2l|30 |. Accordingly, the pairwise 
feature mapping is constructed as 

<pV\y p ,y\x) = L pq .I{y p ^yi), (9) 

where L pq can be the shared boundary length or inversed color difference between neighbouring superpixels. 

The energy function in (|2]i can then be written as 

E(y,x\vf) = /w (1) , Y ^ ( 1 ) (?/ P ;X)\ + /w (2) , Y , ( 2 W,x) V (10) 

\ / \ (p,<?)ES / 

To deal with the unbalanced appearance of different categories in the dataset, we define Z\(y, , y) as the weighted 
Hamming loss, which weighs errors for a given class inversely proportional to the frequency it appears in the training 
data, similar to |f30| . We use the method of BOl to solve the inference in Q. 

3.3. Inference with co-occurrence pairwise potentials 

To fully exploit context information, we consider the frequency of co-occurred object pairs in different spatial 
layouts during the inference. On one hand, this prefers labelling of frequently co-occurred label pairs in a certain 
spatial relation; while on the other hand, it excludes unreasonable labellings of co-occurrences (mutex constraint, 
similar as 051 ). such as grass, water or road appearing above sky. Different from the mutex constraint used in 051 . 
we incorporate the co-occurrence constraint into the pairwise term by devising spatially related co-occurrence pairwise 
potentials. We consider four spatial relations of the adjacent superpixel pairs: p is above q, p is below q, p is left to q 
and p is right to q. Then the feature mapping for the pairwise potential in ( fT0| > is written as: 

Y <ft 2) (y p ,y q ,x) = Y 4 2) (y p ;^ x ) + Y ^ ) (s/ p -!' , ' x ) 

(p,g)eS (p,q) eSi (p,q)es 2 

+ Y 4 2) (2/ p ;2A x ) + 51 2 \y p ,y 9 ,*)- (ii) 

(p,9)es 3 (P:9)es 4 

where Si, § 2 , § 3 , S 4 are the sets of edges where p and q are in the spatial relations “above”, “below”, “left” and “right” 
respectively, and § = §3 U § 2 U §3 U S 4 , and n Sj =0 for = 1,2,3,4. 
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To construct the co-occurrence pairwise potentials, we assume that the training data is sufficiently large. The 
pairwise potentials in (fTT) can then be written as: 


4’?\y j 


p V 9 ,*) = l p 


■ I(y p ^ y q ) ■ gi (y p , y q ), i = 1,2,3A- 


( 12 ) 


where gi (y p ,y q ) = T , - 1 luP vq) with f l co - occur {y p ,y q ) = Here, N pq is the number of training images in 

which y p and y q co-exist, and N pq (i = 1,2,3,4) are the numbers of training images in which y p and y q appear in 
the four spatially related neighbouring superpixels respectively. If N pq = 0, meaning that y p and y q never appear in 
the zth spatial relation, then gi(y p , y q ) = inf , preventing the inference to yield such pair labellings. Intuitively, this 
would prefer labellings that frequently co-occurred in certain spatial relations in the training data, and avoid those 
mutual exclusion labellings, such as grass appear above sky. 

Note that the mutex constraint used in lf35l can be seen a special case of our co-occurrence pairwise poten¬ 
tials, as it is equivalent to ours when we set gi{y p ,y q ) = inf for fl 0 - 0CCUr {y p ,y q ) = 0 and gj(y p ,y q ) = 1 for 
„(v/', y' 1 ) ^ 0. We will provide experimental comparison with this case in Section 4.3 After learning the 


P 

J co—o 

CRF using SSVM, we construct co-occurrence pairwise potentials for prediction. We add a trade-off parameter a 
multiplied to the pairwise term and tune it from 0.5 to 2 based on validation sets. 


4. Experiments 


To demonstrate the effectiveness of the proposed method, we first compare the CNN features with the traditional 
bag-of-words feature and an unsupervised feature learning method JT| as well as evaluate the impact of depths to the 
performance of the CNN features in Section 4.2 We then compare with state-of-the-art methods on several image 


segmentation datasets in Section 4.3 


4.1. Experimental setup 

For the CNN features, we use the model trained on ImageNet provided by Caffe Cm). The network follows the 
famous AlexNet (20) . and is composed of 5 convolutional layers and 2 fully connected layers together with a soft-max 
layer. 

We evaluate the performance of the proposed method on Weizmann horse, Graz-02, MSRC-21, Standford Back¬ 
ground and PASCAL VOC 2011 segmentation challenge dataset. The Weizmann horse datasej^jconsists of 328 horse 
images from various backgrounds, with groundtruth masks available for each image. We use the same data split as in 
0, ED, and we simply resize the images to 256 x 256. The Graz-02 datase0 contains 3 categories (bike, car and 
people). This dataset is considered challenging as the objects appear at various background and with different poses. 
We follow the evaluation protocol in 071 to use 150 for training and 150 for testing for each category. 

The MSRC-21 dataset (37) is a popular multi-class segmentation benchmark with 591 images containing objects 
from 21 categories. We follow the standard split to divide the dataset into training/validation/test subsets. The Stand- 
ford Background dataset da is a collection of outdoor scene images from several publicly available datasets, which 
consists of 715 images coming from 8 categories. Each image is approximately 320 x 240 pixels and contains at 
least one foreground object. We use the same evaluation protocol as in lfl5l to report 5-fold cross validation accuracy 
(global and per-category). The VOC 2011 dataset consists of images from 20 objects and background. We train on 
the training set and test on the validation images. The performance are quantified by the standard VOC measure (9). 

We start with over-segmenting the images into superpixels using SLIC CD (~ 700 superpixels per image) and then 
compute features within regions around each superpixel centroid with different block sizes (36 x 36, 48 x 48, 64 x 64, 
72 x 72 ). We construct four types of pairwise features also using different block sizes to enforce spatial smoothness, 
which are color difference in LUV space, color histogram difference, texture difference in terms of LBP operators 
as well as shared boundary length cm Training our model on the MSRC-21 dataset takes around 2 hours. During 
prediction, the inference is rather efficient (less than 1 sec per image). 


^http://www.msri.org/people/members/eranb/ 
~http://www.emt.tugraz.at/~pinz/ 
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Metric 

SVM 

SSVM 

BoW 

UFL 

L5 

L6 

L7 

BoW 

UFL 

L5 

L6 

L7 

Sa 

87.5 

89.3 

90.1 

92.7 

91.1 

92.3 

94.6 

95.2 

95.7 

95.1 

So 

58.7 

63.6 

68.9 

74.6 

72.9 

72.5 

80.1 

82.4 

84.0 

82.3 


Table 1: Performance of different methods on the Weizmann horse dataset. CNN features perform significantly better than the traditional BoW 
feature and the unsupervised feature learning method, with features of the 6th layer performing marginally better than other compared layers. SSVM 
based CRF learning performs far better than SVM. 


4.2. Baseline Comparison 

To show the superiority of the deep CNN over the unsupervised feature learning, we compare with the traditional 
bag-of-word (BoW) feature and features learned from a popular unsupervised feature learning method [7). Specifi¬ 
cally, we first extract dense SIFT descriptors within each superpixel block and then quantize them into BoW feature 
using nearest neighbour search with a codebook size of 400. For the unsupervised feature learning, we first learn a 
dictionary of size 400 and patch size 6x6 based on the evaluated image dataset using Kmeans, and then use the soft 
threshold coding 0 to encode patches extracted from each superpixel block. The final feature vectors are obtained 
by performing a three-level max pooling over the superpixel block. 

To investigate the roles of different layers in the proposed segmentation method, we evaluate the performance of 
features from the last three layers of the CNN model (5th, 6th and 7th layers). The 5th layer (with dimension 9216) 
is the last convolutional layer of the CNN. The 6th layer (with dimension 4096) is a fully connected layer follows the 
5th layer and the 7th (with dimension 4096) is the final layer of the feature learning pipeline. Using the two types 
of learned features, we compare the SSVM based CRF learning with a baseline method, namely, linear SVM, which 
classifies each superpixel independently without CRF learning. The datasets used in this section are Weizmann horse, 
Graz-02 and MSRC-21. We use BoW to denote the bag-of-words feature, UFL represent the unsupervised feature 
learning method, and L5, L6, L7 are CNN features of the 5th, 6th and 7th layer respectively. 

Weizmann horse. We first test on the Weizmann horse dataset. The performance are quantified by the global pixel- 
wise accuracy S a and the foreground intersection over union score S 0 , similar as in Q. S a measures the percentage of 
pixels correctly classified while S 0 directly reflects the segmentation quality of the foreground. The compared results 
are reported in Table [I] We can observe that the CNN features perform consistently better than the bag-of-words 
feature and the unsupervised learned feature in both SVM and SSVM. By enforcing smoothness term, SSVM based CRF 
learning obtain far better segmentations than simple binary model as SVM. Furthermore, features of different depths 
exhibit almost similar performance with the 6th layer performs marginally better than the other compared layers in 
both SVM and SSVM. In Figure [2] we show some examples of qualitative evaluation, which yields conclusions that 
are in accordance with those from Table [T] 

Graz-02. For a comprehensive evaluation, we use two measurements to quantify the performance of our method 
on the Graz-02 dataset, which are intersection over union score and the pixel accuracy (including foreground and 
background). We report the results in Table[2] It can be observed that feature learning methods generally outperform 
the traditional bag-of-words feature, with CNN features standing as the best. As for different depths, feature of the 
6th layer consistently outperforms all the other compared layers in both SVM and SSVM, which is in accordance with 
the conclusion of m. We show some segmentation examples in Figure [3] from which we can see that SSVM based 
CRF learning with CNN features produces segmentation similar to ground truth. 

MSRC-21. The compared results with features of different layers are summarized in Table [3] Different from the 
binary cases as Weizmann horse and Graz-02, features of the 7th layer perform the best, which may results from the 
fact that MSRC is much more difficult due to the many categories. Figure [4] shows some qualitative results of SSVM 
based CRF learning with different features, from which similar conclusions can be drawn. 


4.3. State-of-the-art comparison 

Based on the above evaluation, we choose the best performed 6th layer for the binary (Weizmann horse and Graz- 
02) and 7th layer features for the multi-class datasets (MSRC-21, Stanford Background and VOC 2011) to learn CRF 
and compare with state-of-the-art results in this section. For the three multi-class datasets, we add the results of 


incorporating the mutex and co-occurrence pairwise potentials introduced in Section 3.3 
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Category 

bike car | people 

bike car | people 


intersection/union (foreground, background) (%) 

pixel accuracy (foreground, background) (%) 

BoW 

UFL 
svm L5 

L6 

L7 

66.5 (50.4. 82.7) 
69.7 (55.0, 84.5) 

74.6 (62.4. 86.8) 

77.7 (66.7. 88.6) 
77.1 (66.0, 88.2) 

66.8(42.2,91.5) 

73.1 (52.7, 93.4) 
76.0 (58.4, 93.7) 

78.1 (61.8, 94.5) 
77.6 (60.8, 94.3) 

64.0(41.9, 86.2) 
61.4(37.2, 85.6) 

65.9 (47.0, 84.9) 

68.9 (51.1,86.6) 
68.4 (50.5, 86.3) 

79.0 (67.9, 90.2) 
81.7 (72.4,91.1) 

86.3 (81.2,91.4) 

88.4 (84.4, 92.5) 
88.2(84.1,92.2) 

75.8 (55.2, 96.3) 

80.9 (64.4, 97.4) 
86.3 (76.2, 96.4) 
87.2 (77.3, 97.0) 
86.6 (76.3, 97.0) 

74.5 (55.4, 93.7) 
71.2 (48.2,94.3) 
80.9 (72.4, 89.4) 
83.0 (75.2, 90.8) 
82.8 (75.1,90.5) 

BoW 

UFL 

SSVM L5 

L6 

L7 

70.9 (56.6. 85.2) 
74.2(61.5, 86.9) 
81.6(72.3, 90.8) 
82.0 (73.1,91.0) 
81.7 (72.6, 90.8) 

75.7 (57.2, 94.1) 
77.9 (60.9, 94.9) 

84.5 (72.6, 96.4) 

85.6 (74.5, 96.6) 
85.1 (73.7, 96.5) 

71.3 (53.5,89.1) 
70.9 (53.0, 88.8) 

75.4 (61.1,89.7) 
79.6 (67.2,92.1) 
76.0 (62.0, 90.0) 

82.5 (73.5, 91.6) 
85.4 (78.6, 92.1) 
91.0(88.0, 93.9) 

91.6 (89.5, 93.7) 
91.3 (89.0, 93.6) 

83.2 (68.9, 97.6) 
83.8 (69.3, 98.4) 
90.6 (82.8, 98.3) 
91.4 (84.4, 98.4) 

91.2 (84.0, 98.4) 

81.4 (68.2,94.7) 

81.5 (68.9,94.2) 
88.8 (85.3,92.3) 
90.0 (85.1,94.8) 
89.3 (86.1,92.4) 


Table 2: Compared results of the average intersection-over-union score and average pixel accuracy on the Graz-02 dataset. We include the 
foreground and background results in the brackets. CNN features perform significantly better than the traditional BoW feature and the unsupervised 
feature learning, with features of the 6th layer performing the best among the compared layers in both SVM and SSVM. SSVM based CRF learning 
performs far better than SVM. 



building 

grass 

tree 

COW 

sheep 

& 

V} 

aeroplane 

water 

face 

03 

O 

bicycle 

flower 

sign 

bird 

book 

chair 

road 

03 

O 

dog 

body 

boat 

Average 

Global 

BoW 

61 

87 

60 

29 

47 

83 

56 

66 

60 

54 

66 

53 

68 

7 

61 

33 

51 

27 

35 

19 

29 

50.1 

62.7 

UFL 

57 

95 

77 

55 

59 

96 

56 

70 

61 

41 

67 

65 

31 

17 

67 

30 

75 

52 

26 

32 

6 

54.1 

69.5 

SVM L5 

77 

91 

86 

79 

83 

95 

80 

85 

81 

76 

84 

81 

52 

55 

82 

64 

83 

81 

63 

68 

25 

74.8 

82.1 

L6 

78 

95 

88 

81 

87 

95 

83 

88 

86 

75 

86 

83 

55 

58 

86 

69 

85 

84 

67 

72 

28 

77.6 

84.9 

L7 

80 

98 

89 

82 

91 

96 

86 

87 

89 

76 

86 

86 

58 

59 

87 

68 

87 

85 

67 

74 

31 

79.0 

86.0 

BoW 

65 

89 

87 

64 

74 

90 

58 

75 

78 

56 

85 

54 

55 

6 

60 

14 

66 

50 

35 

38 

8 

57.4 

70.7 

UFL 

70 

97 

87 

69 

77 

98 

45 

75 

77 

49 

86 

82 

26 

12 

81 

40 

79 

49 

14 

47 

1 

60.1 

76.1 

SSVM L5 

71 

97 

92 

86 

95 

98 

94 

82 

93 

80 

95 

92 

76 

65 

94 

72 

89 

87 

71 

78 

51 

83.9 

86.9 

L6 

71 

94 

93 

89 

96 

96 

95 

85 

92 

85 

95 

90 

71 

68 

94 

77 

92 

93 

75 

81 

54 

85.8 

87.3 

L7 

71 

95 

92 

87 

98 

97 

97 

89 

95 

85 

96 

94 

75 

76 

89 

84 

88 

97 

77 

87 

52 

86.7 

88.5 


Table 3: Segmentation results on the MSRC-21 dataset. We report the pixel-wise accuracy for each category as well as the average per-category 
scores and the global pixel-wise accuracy (%). Deep learning performs significantly better than the BoW feature and the unsupervised feature 
learning, with SSVM based CRF learning using features of the 7th layer of the deep CNN achieving the best results. SSVM based CRF learning 
performs far better than SVM. 


Binary datasets. Table[4]shows the compared segmentation results on the Weizmann horse and the Graz-02 datasets. 
We use a different evaluation metric for comparison on the Graz-02 dataset, which is the F-score (F = 2 pr/(p + r), 
where p is the precision and r is the recall) for each class and the average over classes. In both cases, our method 
outperforms all the compared methods. 

Multi-class datasets. The compared global and average per-category pixel accuracies on the MSRC-21 and the Stan¬ 
ford Background datasets are summarized in Table[5] On the MSRC dataset, our method outperforms all the methods 
except l35l . When incorporated with mutex or co-occurrence pairwise potentials in inference, we obtain further im¬ 
provements. As expected, the co-occurrence potentials outperform the mutex potentials. l35l performs slightly better 
than ours in terms of global accuracy (they did not report average per-category accuracy), which may results from the 
fact that they use a fully connected CRF while ours are not. 

As for the Stanford Background dataset, we can see that our method performs better than itTOl and outperforming 
all the others. The work of BTOl trains a 3-stage multiscale convolutional network on the training images while we 


Method 

Sa 

So 

Levin & Weiss |27] 

95.5 

- 

Cosegmentation 1 191 

80.1 

- 

Bertelli et al. J5l 

94.6 

80.1 

Kuttel et al. [211 

94.7 

- 

Ours 

95.7 

84.0 


Method 

bike 

car 

people 

average 

Marszalek & Schimid [31 1 

61.8 

53.8 

44.1 

53.2 

Fulkerson et al. [Ill 

66.4 

54.7 

51.4 

57.5 

Aldavert et al. [2| 

71.9 

62.9 

58.6 

64.5 

Kuettel et al. 1211 

63.2 

74.8 

66.4 

68.1 

Ours 

84.5 

85.4 

80.4 

83.4 


Table 4: State-of-the-art comparison of segmentation performance (%) on the Weizmann horse (left) and Graz-02 (right) datasets. 
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Method 

Global (%) 

Average (%) 

Shotton et al. [371 

72 

67 

Ladicky et al. 

86 

75 

Munoz et al. [3^ 

78 

71 

Gonfaus et al. [14] 

77 

75 

Lucchi et al. [301 

73 

70 

Yao et al. [39] 

86.2 

79.3 

Lucchi et al. [29] 

83.7 

78.9 

Ladicky et al. [23] 

87 

77 

Roy et al. [35] 

91.5 


Ours 

88.5 

86.7 

Ours (mutex) 

90.3 

89.2 

Ours (co-occur) 

91.1 

90.5 


Method 

Global (%) 

Average (%) 

Gould et al. [^ 

76.4 


Munoz et al. |32] 

76.9 

66.2 

Lempitsky et al. 1261 

81.9 

72.4 

Farabet et al. [101 

81.4 

76.0 

Roy et al. |351 

81.1 


Ours 

82.6 

76.2 

Ours (mutex) 

82.6 

76.3 

Ours (co-occur) 

83.5 

76.9 


Table 5: State-of-the-art comparison of global and average per-category pixel accuracy on the MSRC-21 (left) and the Stanford Background (right) 
datasets. 


VOC 2011 val 

bg 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

house 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

mean 

Ours 

78.3 

43.9 

20.4 

23.2 

22.7 

24.6 

42.2 

41.0 

36.1 

12.6 

24.9 

19.8 

25.0 

23.8 

38.6 

53.3 

20.0 

36.6 

20.2 

38.1 

24.6 

31.9 

Ours (mutex) 

79.8 

53.1 

23.8 

26.4 

28.8 

28.6 

51.6 

48.2 

37.8 

13.1 

29.7 

22.3 

28.4 

29.6 

45.2 

52.7 

21.0 

46.2 

20.9 

46.2 

29.6 

36.3 

Ours (co-occur) 

81.5 

55.7 

23.6 

24.0 

27.7 

27.3 

52.8 

54.1 

37.1 

14.9 

37.1 

28.6 

22.9 

33.1 

49.7 

54.2 

27.4 

49.3 

22.3 

49.3 

30.9 

38.3 


Table 6: Results of per-category and mean segmentation accuracy (%) on the PASCAL VOC 2011 validation dataset. Best results are bold faced. 


directly transfer the deep CNN trained on the ImageNet to here sparing the effort of network training. Adding mutex 
potentials to our method do not bring any performance boost. By further investigations, we found that this is because 
there is only eight categories (one of which is the ambiguous foreground category) in this dataset, which leads to the 
fact that the only mutex information obtained is that grass, water and road can not appear above sky. Instead, our 
co-occurrence potentials perform much better, leading to further performance boost. We show some segmentation 
examples in Figure [5] 

The segmentation results on the PASCAL VOC 2011 validation dataset are reported in Table[6] In iTDl . Girshick 
et al. achieved an average accuracy of 47.9 by using augmented training data and extra annotation set. Here we did 
not use any extra dataset but only the VOC training set. By introducing mutex or co-occurrence pairwise potentials, 
constant improvements are observed on most of the categories. As expected, our co-occurrence potential again out¬ 
performs the mutex potential. In Table [7] we compare with the recent work of Carreira et al. (6), which performed 
evaluations with the same settings as ours (using the train/val set). Our method achieves the same accuracy as ID- 
Note that the dimension of the feature descriptors used in ID is tens of thousands of (33589) while ours is 4096. 
Qualitative examples and some failure cases are shown in Figure [6] and Figure [7] 

5. Conclusion 

We propose to learn CRF using SSVM based on features learned from a pre-trained deep convolutional neural 
network for image segmentation. The deep CNN is trained on ImageNet and proved to perform exceptionally well 
when transferred to object segmentation. We learn the CRF in the large margin framework by SSVM, and then conduct 
inference with co-occurrence pairwise potentials incorporated. Extensive experimental evaluations on the Weizmann 
horse, Graz-02, MSRC-21, Stanford Background and the PASCAL VOC 2011 dataset demonstrate the advantages of 
our method and provide new baselines for further research. 
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Method 
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HOG IQ 

14.1 

SIFT-PCA-FISHER (6 
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0 2 P(6] 
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Table 7: Comparison of the mean segmentation accuracy (%) on the PASCAL VOC 2011 validation dataset. 
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Figure 3: Segmentation examples on the Graz-02 dataset. 1st row: Test images; 2nd row: Ground truth; 3rd row: Segmentation results produced 
by SSVM based CRF learning with bag-of-words feature; 4th row: Segmentation results produced by SSVM based CRF learning with unsupervised 
feature learning; 5th row: Segmentation results produced by SSVM based CRF learning with the 6th layer CNN features. 



Figure 4: Segmentation examples on MSRC. 1st row: Test images; 2nd row: Ground truth; 3rd row: Segmentation results produced by SSVM based 
CRF learning with bag-of-words feature; 4th row: Segmentation results produced by SSVM based CRF learning with unsupervised feature learning; 
5th row: Segmentation results produced by our method with co-occurrence pairwise potentials. 



Figure 5: Segmentation examples on the Stanford Background dataset. 1st row: Test images; 2nd row: Ground truth; 3rd row: Segmentation results 
produced by our method with co-occurrence pairwise potentials. 
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Figure 6: Segmentation examples on the VOC 2011 dataset. 1st row: Test images; 2nd row: Ground truth; 3rd row: Segmentation results produced 
by our method with co-occurrence pairwise potentials. 



Figure 7: Failure examples on the VOC 2011 dataset. 1st row: Test images; 2nd row: Ground truth; 3rd row: Segmentation results produced by 
our method with co-occurrence pairwise potentials. 
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